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Abstract 

We describe the application of data mining algorithms to research prob- 
lems in astronomy. We posit that data mining has always been fundamen- 
tal to astronomical research, since data mining is the basis of evidence- 
based discovery, including classification, clustering, and novelty discov- 
ery. These algorithms represent a major set of computational tools for 
discovery in large databases, which will be increasingly essential in the 
era of data-intensive astronomy. Historical examples of data mining in 
astronomy are reviewed, followed by a discussion of one of the largest 
data-producing projects anticipated for the coming decade: the Large 
Synoptic Survey Telescope (LSST) . To facilitate data-driven discoveries in 
astronomy, we envision a new data-oriented research paradigm for astron- 
omy and astrophysics - astroinformatics. Astroinformatics is described 
as both a research approach and an educational imperative for modern 
data-intensive astronomy. An important application area for large time- 
domain sky surveys (such as LSST) is the rapid identification, charac- 
terization, and classification of real-time sky events (including moving 
objects, photometrically variable objects, and the appearance of tran- 
sients). We describe one possible implementation of a classification broker 
for such events, which incorporates several astroinformatics techniques: 
user annotation, semantic tagging, metadata markup, heterogeneous data 
integration, and distributed data mining. Examples of these types of 
collaborative classification and discovery approaches within other science 
disciplines are presented. 



1 Introduction 



It has been said that astronomers have been doing data mining for centuries: 
"the data are mine, and you cannot have them!". Seriously, astronomers are 
trained as data miners, because we are trained to: (a) characterize the known 
(i.e., unsupervised learning, clustering); (b) assign the new (i.e., supervised 
learning, classification); and (c) discover the unknown (i.e., semi-supervised 
learning, outlier detection) [121 03] • These skills are more critical than ever 



since astronomy is now a data-intensive science, and it will become even more 
data- intensive in the coming decade [25l [72 [9] . 

We describe the new data-intensive research paradigm that astronomy and 
astrophysics are now entering [331 030 HO] ■ This is described within the context of 
the largest data-producing astronomy project in the coming decade - the LSST 
(Large Synoptic Survey Telescope). The enormous data output, database con- 
tents, knowledge discovery, and community science expected from this project 
will impose massive data challenges on the astronomical research community. 
One of these challenge areas is the rapid machine learning (ML), data mining, 
and classification of all novel astronomical events from each 3-gigapixel (6-GB) 
image obtained every 20 seconds throughout every night for the project dura- 
tion of 10 years. We describe these challenges and a particular implementation 
of a classification broker for this data fire hose. But, first, we review some of 
the prior results of applying data mining techniques in astronomical research. 
A similar, more thorough survey of data mining and ML in astronomy was 
published [7] after this paper was published^- 

2 Data Mining Applications in Astronomy 

Astronomers classically have focused on clustering and classification problems 
as standard practice in our research discipline. This is especially true of obser- 
vational (experimental) astronomers who collect data on objects in the sky, and 
then try to understand the objects' physical properties and hence understand 
the underlying physics that leads to those properties. This invariably leads to a 
partitioning of the objects into classes and subclasses, which reflect the manifes- 
tation of different physical processes that appear dominant in different classes 
of objects. Even theoretical astrophysicists, who apply pure physics and applied 
mathematics to astronomy problems, are usually (though not always) governed 
by the results of the experimentalists - to identify classes of behavior within 
their models, and to make predictions about further properties of those classes 
that will enhance our understanding of the underlying physics. 

2.1 Clustering 

Clustering usually has a very specific meaning to an astronomer - that is "spatial 
clustering" (more specifically, angular clustering on the sky). In other words, 
we see groupings of stars close together in the sky, which we call star clusters. 
We also see groupings of galaxies in the sky, which we call galaxy clusters (or 
clusters of galaxies). On even larger spatial scales, we see clusters of clusters 
of galaxies (superclusters) - e.g., our Milky Way galaxy belongs to the Local 
Group of Galaxies, which belongs to the Local Supercluster. Most of these clus- 
ter classes can be further subdivided and specialized: e.g., globular star clusters 
versus open star clusters; or loose groups of galaxies versus compact groups of 
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galaxies; or rich clusters of galaxies versus poor clusters of galaxies. Two of the 
research problems that are addressed by astronomers who study these objects 
are discovery and membership - i.e., discovering new clusters, and assigning 
objects as members of one or another cluster. These astronomical applications 
of clustering are similar to corresponding ML applications. Because clustering 
is standard research practice in astronomy, it is not possible to summarize the 
published work in this area, since it would comprise a significant fraction of 
all research papers published in all astronomy journals and conference proceed- 
ings over the last century. Specific data mining applications of clustering in 
astronomy include the search for rare and new types of objects [3T1 [3"2"1 [3"5] . 

More generally, particularly for the ML community, clustering refers to class 
discovery and segregation within any parameter space (not just spatial cluster- 
ing) . Astronomers perform this general type of clustering also [301 ESS ■ For 
example, there are many objects in the Universe for which at least two classes 
have been discovered. Astronomers have not been too creative in labeling these 
classes, which include: Types I and II supernovae, Types I and II Cepheid vari- 
able stars, Populations I and II (and maybe III) stars, Types I and II active 
galaxies, and so on, including further refinement into subclasses for some of 
these. These observationally different types of objects (segregated classes) were 
discovered when astronomers noticed clustering in various parameter spaces 
(i.e., in scatter plots of measured scientific parameters). 

2.2 Classification 

The other major dimension of astronomical research is the assignment of objects 
to classes. This was historically carried out one-at-a-time, as the data were col- 
lected one object at a time. ML and data mining classification algorithms were 
not explicitly necessary. However, in fact, the process is the same in astronomy 
as in data mining: (1) class discovery (clustering); (2) discover rules for the 
different classes (e.g., regions of parameter space); (3) build training samples 
to refine the rules; (4) assign new objects to known classes using new measured 
science data for those objects. Hence, it is accurate to say that astronomers 
have been data mining for centuries. Classification is a primary feature of as- 
tronomical research. We are essentially zoologists - we classify objects in the 
astronomical zoo. 

As the data sets have grown in size, it has become increasingly appropriate 
and even imperative to apply ML algorithms to the data in order to learn the 
rules and to apply the rules of classification. Algorithms that have been used 
include Bayesian analysis, decision trees, neural networks, and (more recently) 
support vector machines. The ClassX project has used a network of classifiers 
in order to estimate classes of X-ray sources using distributed astronomical data 
collections [701171] . 

We will now briefly summarize some specific examples of these. But first we 
present a more general survey of data mining research in astronomy. 



2.3 General Survey of Astronomical Data Mining 



A search of the online astronomical literature database ADS (NASA's Astro- 
physics Data System) lists only 63 refereed astronomy research papers (765 
abstracts of all types - refereed and unrefereed) that have the words "data 
mining" or "machine learning" in their abstracts. (Note that ADS searches a 
much broader set of disciplines than just astronomy when non-refereed papers 
are included - most of these search results are harvested from the ArXiv.org 
manuscript repository.) Of course, there are many fine papers related to as- 
tronomical data mining in the SIAM, ACM, IEEE, and other journals and 
proceedings that are not harvested by ADS. 

Within the ADS list of refereed papers, the earliest examples that explicitly 
refer to "data mining" in their abstract are two papers that appeared in 1997 
- these were general perspective papers. (Note that there are many papers, 
including [TJ] and [T3] that were not in the refereed literature but that pre-date 
the 1997 papers.) The first of the refereed data mining application papers that 
explicitly mentions "data mining" in the abstract and that focused on a specific 
astronomy research problem appeared in 2000 5SJ . This paper described all-sky 
monitoring and techniques to detect millions of variable (transient) astronomical 
phenomena of all types. This was an excellent precursor study to the LSST (see 
S3)- 

Among the most recent examples of refereed papers in ADS that explicitly 
refer to data mining (not including this author's work [17]) is the paper [47] 
that addresses the same research problem as [58]: the automated classification 
of large numbers of transient and variable objects. Again, this research is a 
major contributor and precursor to the LSST research agenda ( §4.3| . 

A very recent "data mining" paper focuses on automatic prediction of Solar 
CMEs (Coronal Mass Ejections), which lead to energetic particle events around 
the Earth-Moon-Mars environment, which are hazardous to astronauts outside 
the protective shield of the Earth's magnetosphere [51] . This is similar to the 
data mining research project just beginning at George Mason University with 
this author [55] , 

Additional recent work includes investigations into robust ML for terascale 
astronomical datasets [5], 

In addition to these papers, several astronomy-specific data mining projects 
are underway. These include AstroWeka (http://astroweka.sourceforge.net/), 
Grist (Grid Data Mining for Astronomy ; |http://grist.caltech.edu[) , the Labo- 
ratory for Cosmological Data Mining ( |http://lcdm.astro.uiuc.edu/l ), the LSST 
Data Mining Research study group (PS]), the Transient Classification Project 
at Berkeley [11], and the soon-to-be commissioned Palomar Transient Factory. 

We will now look at more specific astronomical applications that employed 
ML and data mining techniques. We have not covered everything (e.g., other 
methods that have been applied to astronomical data mining include princi- 
pal component analysis, kernel regression, random forests, and various nearest- 
neighbor methods, such as [80l [30l [36j [211 [27l [6J ) . 



2.3.1 Bayesian Analysis 

A search of ADS lists 575 refereed papers in which the words Bayes or Bayesian 
appear in the paper's abstract. For comparison, the same search criteria re- 
turned 2313 abstracts of all papers (refereed and non-refereed) . Seven of the 
refereed papers were published before 1980 (none published before 1970). One 
of these was by Sebok [67] • He applied Bayesian probability analysis to the 
most basic astronomical classification problem - distinguishing galaxies from 
stars among the many thousands of objects detected in large images. This is 
a critical problem in astronomy, since the study of stars is a vastly different 
astrophysics regime than the study of galaxies. To know which objects in the 
image are stars, and hence which objects are galaxies, is critical to the science. 
It may seem that this is an obvious distinction, but that is only true for nearby 
galaxies, which appear large on the sky (with large angular extent). This is not 
true at all for very distant galaxies, who provide the most critical information 
about the origin and history of our Universe. These distant galaxies appear 
as small blobs on images, almost indistinguishable from stars - nearly 100% of 
the stars are in our Milky Way Galaxy, hence very very nearby (by astronom- 
ical standards), and consequently stars therefore carry much less cosmological 
significance. 

A more recent example is the application of Bayesian analysis to the problem 
of star formation in young galaxies [45j : the authors applied a Bayesian Markov 
Chain Monte Carlo method to determine whether the stars in the galaxies form 
in one monolithic collapse of a giant gas cloud, or if they form in a hierarchical 
fashion (with stars forming in smaller galaxies forming first, then those galaxies 
merge to become larger galaxies, and so on). The latter seems to be the best 
model to fit the observational data. 

The above examples illustrate a very important point. The large number of 
papers that refer to Bayes analysis does not indicate the number that are doing 
data mining. This is because Bayesian analysis is used primarily as a statistical 
analysis technique or as a probability density estimation technique. The latter 
is certainly applicable to classification problems, but not on a grand scale as we 
expect for data mining (i.e., discovering hidden knowledge contained in large 
databases). 

One significant recent paper that applies Bayesian analysis in a data mining 
sense focuses on a very important problem in large-database astronomy: cross- 
identification of astronomical sources in multiple large data collections [26] . In 
order to match the same object across multiple catalogs, the authors have pro- 
posed the use of more than just spatial coincidence, but also include numerous 
physical properties, including colors, redshift (distance), and luminosity. The 
result is an efficient algorithm that is ready for petascale astronomical data 
mining. 



2.3.2 Decision Trees 



A search of ADS lists 21 refereed papers (166 abstracts of all types) in which 
"Decision Tree" appears in the abstract. One of the earliest (non-refereed) con- 
ference papers was the 1994 paper by Djorgovski, Wier, and Fayyad [33], when 
Fayyad was working at NASA's Jet Propulsion Lab. This paper described the 
SKICAT classification system, which was the standard example of astronomical 
data mining quoted in many data mining conference talks subsequently. The 
earliest paper we could find in astronomy was in 1975 |33], 16 years before the 
next paper appeared. The 1975 paper addressed a "new methodology to inte- 
grate planetary quarantine requirements into mission planning, with application 
to a Jupiter orbiter". 

Decision trees have been applied to another critical research problem in 
astronomy by ]oE\ - the identification of cosmic ray (particle radiation) con- 
tamination in astronomical images. Charge- coupled device (CCD) cameras not 
only make excellent light detectors, they also detect high-energy particles that 
permeate space. Cosmic-ray particles deposit their energy and create spikes in 
CCD images (in the same way that a light photon does). The cosmic-ray hits 
are random (as the particles enter the detector randomly from ambient space) 
- they have nothing to do with the image. Understanding the characteristics 
of these bogus "events" (background noise) in astronomical images and being 
able to remove them are very important steps in astronomical image processing. 
The decision tree classifiers employed by [66] produced 95% accuracy. Recently, 
researchers have started to investigate the application of neural networks to the 
same problem [78] (and others). 

2.3.3 Neural Networks 

A search of ADS lists 418 refereed papers in which the phrases "Neural Net" 
or "Neural Network" appear in the paper's abstract. For comparison, the same 
search criteria returned over 10,000 abstracts of all papers (refereed and non- 
refereed, most of which are not in astronomy; see §2.3|) . The earliest of these 
refereed papers that appeared in an astronomical journal |44] (published in 1986) 
addressed neural networks and simulated annealing algorithms in general. One 
of the first real astronomical examples that was presented in a refereed paper [2] 
applied a neural network to the problem of rapid adaptive mirror adjustments 
in telescopes in order to dramatically improve image quality. 

As mentioned above, artificial neural networks (ANN) have been applied to 
the problem of cosmic-ray detection CCD images. ANN have also been applied 
to another important problem mentioned earlier: star-galaxy discrimination 
(classification) in large images. Many authors have applied ANN to this prob- 
lem, including [54] [55] 08] [H [29] [59] [62] . Of course, this astronomy research 
problem has been tackled by many algorithms, including decision trees [4]. 

Two other problems that have received a lot of astronomical research atten- 
tion using neural networks are: (a) the classification of different galaxy types 
within large databases of galaxy data (e.g., [68] [53j [40], [3J ) ; and (b) the determi- 



nation of the photometric redshift estimate, which is used as an approximator 
of distance for huge numbers of galaxies, for which accurate distances are not 
known (e.g., [37J HH1 H2 HZ])- The latter problem has also been investigated 
recently using random forests [27j and support vector machines. 

2.3.4 Support Vector Machines (SVM) 

ADS lists 154 abstracts (refereed and non-refereed) that include the phrase 
"Support Vector Machine", of which 21 of these are refereed astronomy journal 
papers. Three of the latter focus on the problem mentioned earlier: determina- 
tion of the photometric redshift estimate for distant galaxies [7SJ [75] HI) . Note 
that [77] also applies a kernel regression method to the problem - the authors 
find that kernel regression is slightly more accurate than SVM, but they discuss 
the positives and negatives of the two methods. SVM was used in conjunction 
with a variety of other methods to address the problem of cross-identification 
of astronomical sources in multiple data collections that was described earlier 
[64] [65] . SVM has also been used by several authors for forecasting solar flares 
and solar wind-induced geostorms, including [3"5I 1551 ICT] . 

3 Data-Intensive Science 

The development of models to describe and understand scientific phenomena 
has historically proceeded at a pace driven by new data. The more we know, 
the more we are driven to tweak or to revolutionize our models, thereby ad- 
vancing our scientific understanding. This data-driven modeling and discovery 
linkage has entered a new paradigm 49J. The acquisition of scientific data in 
all disciplines is now accelerating and causing a nearly insurmountable data 
avalanche [10] , In astronomy in particular, rapid advances in three technology 
areas (telescopes, detectors, and computation) have continued unabated - all of 
these advances lead to more and more data [9] . With this accelerated advance in 
data generation capabilities, humans will require novel, increasingly automated, 
and increasingly more effective scientific knowledge discovery systems |16j . 

To meet the data-intensive research challenge, the astronomical research 
community has embarked on a grand information technology program, to de- 
scribe and unify all astronomical data resources worldwide. This global inter- 
operable virtual data system is referred to as the National Virtual Observatory 
(NVO, at www.us-vo.org) in the U.S., or more simply the "Virtual Observatory" 
(VO). Within the international research community, the VO effort is steered by 
the International Virtual Observatory Alliance (IVOA at www.ivoa.net). This 
grand vision encompasses more than a collection of data sets. The result is a 
significant evolution in the way that astrophysical research, both observational 
and theoretical, is conducted in the new millennium 51J. This revolution is 
leading to an entirely new branch of astrophysics research - Astroinformatics 
- still in its infancy, consequently requiring further research and development 
as a discipline in order to aid in the data-intensive astronomical science that is 



emerging [22] , 

The VO effort enables discovery, access, and integration of data, tools, and 
information resources across all observatories, archives, data centers, and indi- 
vidual projects worldwide [60j . However, it remains outside the scope of the 
VO projects to generate new knowledge, new models, and new scientific un- 
derstanding from the huge data volumes flowing from the largest sky survey 
projects [8, 9]. Even further beyond the scope of the VO is the ensuing feedback 
and impact of the potentially exponential growth in new scientific knowledge 
discoveries back onto those telescope instrument operations. In addition, while 
the VO projects are productive science-enabling I.T. research and development 
projects, they are not specifically scientific (astronomical) research projects. 
There is still enormous room for scientific data portals and data-intensive sci- 
ence research tools that integrate, mine, and discover new knowledge from the 
vast distributed data repositories that are now VO-accessible [HI [19] . 

The problem therefore is this: astronomy researchers will soon (if not al- 
ready) lose the ability to keep up with any of these things: the data flood, the 
scientific discoveries buried within, the development of new models of those phe- 
nomena, and the resulting new data-driven follow-up observing strategies that 
are imposed on telescope facilities to collect new data needed to validate and 
augment new discoveries. 

4 Astronomy Sky Surveys as Data Producers 

A common feature of modern astronomical sky surveys is that they are pro- 
ducing massive (terabyte) databases. New surveys may produce hundreds of 
terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data 
archive and in the object catalogs (databases). Interpreting these petabyte cat- 
alogs (i.e., mining the databases for new scientific knowledge) will require more 
sophisticated algorithms and networks that discover, integrate, and learn from 
distributed petascale databases more effectively [42], [46] . 

4.1 The LSST Sky Survey Database 

One of the most impressive astronomical sky surveys being planned for the next 
decade is the Large Synoptic Survey Telescope project (LSST at www.lsst.org) 
[73 . The three fundamental distinguishing astronomical attributes of the LSST 
project are: 

1. Repeated temporal measurements of all observable objects in the sky, cor- 
responding to thousands of observations per each object over a 10-year 
period, expected to generate 10,000-100,000 alerts each night - an alert 
is a signal (e.g., XML-formatted news feed) to the astronomical research 
community that something has changed at that location on the sky: either 
the brightness or position of an object, or the serendipitous appearance of 
some totally new object; 



2. Wide-angle imaging that will repeatedly cover most of the night sky within 
3 to 4 nights (= tens of billions of objects); and 

3. Deep co-added images of each observable patch of sky (summed over 10 
years: 2016-2026), reaching far fainter objects and to greater distance over 
more area of sky than other sky surveys [69) . 

Compared to other astronomical sky surveys, the LSST survey will deliver 
time domain coverage for orders of magnitude more objects. It is envisioned 
that this project will produce —30 TB of data per each night of observation for 
10 years. The final image archive will be —70 PB (and possibly much more), 
and the final LSST astronomical object catalog (object-attribute database) is 
expected to be -10-20 PB. 

LSST's most remarkable data product will be a 10-year "movie" of the entire 
sky = "Cosmic Cinematography" . This time- lapse coverage of the night sky 
will open up time-domain astronomy like no other project has been able to do 
previously. In general, astronomers have a good idea of what things in the sky 
are varying and what things are not varying, as a result of many centuries of 
humans staring at the sky, with and without the aid of telescopes. But, there is 
so much more possibly happening that we are not aware of at the very faintest 
limits simply because we have not explored the sky systematically night after 
night on a large scale. When an unusual time-dependent event occurs in the sky 
(e.g., a gamma-ray burst, supernova, or in-coming asteroid), astronomers (and 
others) will not only want to examine spatial coincidences of this object within 
the various surveys, but they will also want to search for other data covering 
that same region of the sky that were obtained at the same time as this new 
temporal event. These contextual data will enable more robust classification 
and characterization of the temporal event. Because of the time-criticality and 
potential for huge scientific payoff of such follow-up observations of transient 
phenomena, the classification system must also be able to perform time-based 
searches very efficiently and very effectively (i.e., to search all of the distributed 
VO databases as quickly as possible). One does not necessarily know in advance 
if such a new discovery will appear in any particular waveband, and so one will 
want to examine all possible astronomical sky surveys for coincidence events. 
Most of these "targets of opportunity" will consequently be added immediately 
to the observing programs of many ground-based and space-based astronomical 
telescopes, observatories, and on-going research experiments worldwide. 

4.2 The LSST Data-Intensive Science Challenge 

LSST is not alone. It is one (likely the biggest one) of several large astronomi- 
cal sky survey projects beginning operations now or within the coming decade. 
LSST is by far the largest undertaking, in terms of duration, camera size, depth 
of sky coverage, volume of data to be produced, and real-time requirements on 
operations, data processing, event-modeling, and follow-up research response. 
One of the key features of these surveys is that the main telescope facility will be 



dedicated to the primary survey program, with no specific plans for follow-up ob- 
servations. This is emphatically true for the LSST project [52]. Paradoxically 
the follow-up observations are scientifically essential - they contribute signifi- 
cantly to new scientific discovery, to the classification and characterization of 
new astronomical objects and sky events, and to rapid response to short-lived 
transient sky phenomena. 

Since it is anticipated that LSST will generate many thousands (probably 
tens of thousands) of new astronomical event alerts per night of observation, 
there is a critical need for innovative follow-up procedures. These procedures 
necessarily must include modeling of the events - to determine their classifica- 
tion, time-criticality, astronomical relevance, rarity, and the scientifically most 
productive set of follow-up measurements. Rapid time-critical follow-up obser- 
vations, with a wide range of time scales from seconds to days, are essential for 
proper identification, classification, characterization, analysis, interpretation, 
and understanding of nearly every astrophysical phenomenon {e.g., supernovae, 
novae, accreting black holes, microquasars, gamma-ray bursts, gravitational mi- 
crolensing events, extrasolar planetary transits across distant stars, new comets, 
incoming asteroids, trans-Neptunian objects, dwarf planets, optical transients, 
variable stars of all classes, and anything that goes "bump in the night"). 

4.3 Petascale Data Mining with the LSST 

LSST and similar large sky surveys have enormous potential to enable countless 
astronomical discoveries. Such discoveries will span the full spectrum of statis- 
tics: from rare one-in-a-billion (or one-in-a-trillion) type objects, to a complete 
statistical and astrophysical specification of a class of objects (based upon mil- 
lions of instances of the class) . One of the key scientific requirements of these 
projects therefore is to learn rapidly from what they see. This means: (a) to 
identify the serendipitous as well as the known; (b) to identify outliers (e.g., 
"front-page news" discoveries) that fall outside the bounds of model expecta- 
tions; (c) to identify rare events that our models say should be there; (d) to 
find new attributes of known classes; (e) to provide statistically robust tests of 
existing models; and (f) to generate the vital inputs for new models. All of this 
requires integrating and mining of all known data: to train classification models 
and to apply classification models. 

LSST alone is likely to throw such data mining and knowledge discovery 
efforts into the petascale realm. For example: astronomers currently discover 
~100 new supernovae (exploding stars) per year. Since the beginning of human 
history, perhaps ^f0,000 supernovae have been recorded. The identification, 
classification, and analysis of supernovae are among the key science requirements 
for the LSST Project to explore Dark Energy - i.e., supernovae contribute to 
the analysis and characterization of the ubiquitous cosmic Dark Energy. Since 
supernovae are the result of a rapid catastrophic explosion of a massive star, it 
is imperative for astronomers to respond quickly to each new event with rapid 
follow-up observations in many measurement modes (light curves; spectroscopy; 
images of the host galaxy's environment). Historically, with <10 new supernovae 



being discovered each week, such follow-up has been feasible. But now, LSST 
promises to produce a list of 1000 new supernovae each night for 10 years [69] . 
which represent a small fraction of the total (10-100 thousand) alerts expected 
each night! Astronomers are faced with the enormous challenge of efficiently 
mining, correctly classifying, and intelligently prioritizing a staggering number 
of new events for follow-up observation each night for a decade. 

The major features and contents of the LSST scientific database include: 

— >100 database tables 

— Image metadata = 675M rows 

— Source catalog = 260B rows 

— Object catalog = 22B rows, with 200+ attributes 

— Moving Object catalog 

— Variable Object catalog 

— Alerts catalog 

— Calibration metadata 

— Configuration metadata 

— Processing metadata 

— Provenance metadata 

Many possible scientific data mining use cases are anticipated with the LSST 
database, including: 

• Provide rapid probabilistic classifications for all 10,000 LSST events each 
night; 

• Find new "fundamental planes" of parameters (e.g., the fundamental plane 
of Elliptical galaxies); 

• Find new correlations, associations, relationships of all kinds from 100+ 
attributes in the science database; 

• Compute N-point correlation functions over a variety of spatial and astro- 
physical parameters; 

• Discover voids or zones of avoidance in multi-dimensional parameter spaces 
(e.g., period gaps); 

• Discover new and exotic classes of astronomical objects, while discovering 
new properties of known classes; 

• Discover new and improved rules for classifying known classes of objects 
(e.g., photometric redshifts); 

• Identify novel, unexpected behavior in the time domain from time series 
data of all known variable objects; 

• Hypothesis testing - verify existing (or generate new) astronomical hy- 
potheses with strong statistical confidence, using millions of training sam- 
ples; 



• Serendipity - discover the rare one-in-a-billion type of objects through 
outlier detection; and 

• Quality assurance - identify glitches, anomalies, image processing errors 
through deviation detection. 

Some of the data mining research challenge areas posed by the petascale 
LSST scientific database include: 

• scalability (at petabytes scales) of existing ML and data mining algo- 
rithms; 

• development of grid-enabled parallel data mining algorithms; 

• designing a robust system for brokering classifications from the LSST event 
pipeline; 

• multi-resolution methods for exploration of petascale databases; 

• visual data mining algorithms for visual exploration of the massive databases; 

• indexing of multi-attribute multi-dimensional astronomical databases (be- 
yond sky-coordinate spatial indexing); and 

• rapid querying of petabyte databases. 

5 A Classification Broker for Astronomy 

We are beginning to assemble user requirements and design specifications for a 
ML engine (data integration network plus data mining algorithms) to address 
the petascale data mining needs of the LSST and other large data-intensive 
astronomy sky survey projects. The data requirements surpass those of the cur- 
rent Sloan Digital Sky Survey (SDSS, at www.sdss.org) by 1000-10,000 times, 
while the timc-criticality requirement (for event/object classification and char- 
acterization) drastically drops from months (or weeks) down to minutes (or 
tens of seconds). In addition to the follow-up classification problem (described 
above), astronomers also want to find every possible new scientific discovery 
(pattern, correlation, relationship, outlier, new class, etc.) buried within these 
new enormous databases. This might lead to a petascale data mining compute 
engine that runs in parallel alongside the data archive, testing every possible 
model, association, and rule. We will focus here on the time-critical data mining 
engine (i.e., classification broker) that enables rapid follow-up science for the 
most important and exciting astronomical discoveries of the coming decade, on 
a wide range of time scales from seconds to days, corresponding to a plethora 
of exotic astrophysical phenomena. 



5.1 Broker Specifications: AstroDAS 



The classification broker's primary specification is to produce and distribute 
scientifically robust near-real-time classification of astronomical sources, events, 
objects, or event host objects (i.e., the astronomical object that hosts the event; 
e.g., the host galaxy for some distant supernova explosion - it is important to 
measure the redshift distance of the host galaxy in order to interpret and to clas- 
sify properly the supernova). These classifications are derived from integrating 
and mining data, information, and knowledge from multiple distributed data 
repositories. The broker feeds off existing robotic telescope and astronomical 
alert networks world-wide, and then integrates existing astronomical knowledge 
(catalog data) from the VO. The broker may eventually provide the knowledge 
discovery and classification service for LSST, a torrential fire hose of data and 
astronomical events. 

Incoming event alert data will be subjected to a suite of ML algorithms for 
event classification, outlier detection, object characterization, and novelty dis- 
covery. Probabilistic ML models will produce rank-ordered lists of the most sig- 
nificant and/or most unusual events. These ML models (e.g., Bayesian networks, 
decision trees, multiple weak classifiers, Markov models, or perhaps scientifically 
derived similarity metrics) will be integrated with astronomical taxonomies and 
ontologies that will enable rapid information extraction, knowledge discovery, 
and scientific decision support for real-time astronomical research facility oper- 
ations - to follow up on the 10-100K alertable astronomical events that will be 
identified each night for 10 years by the LSST sky survey. 

The classification broker will include a knowledgebase to capture the new 
labels (tags) that are generated for the new astronomical events. These tags 
are annotations to the events. "Annotation" refers to tagging the data and 
metadata content with descriptive terms. For this knowledgebase, we envision a 
collaborative tagging system, called AstroDAS (Astronomy Distributed Annota- 
tion System) [24]. AstroDAS is similar to existing science knowledgebases, such 
as BioDAS (biodas.org), WikiProteins (www.wikiprofessional.info), the Hclio- 
physics Knowledgebase (HPKB; www.lmsal.com/helio-informatics/hpkb/), and 
The Entity Describer [IT]. AstroDAS is "distributed" in the sense that the 
source data and metadata are distributed, and the users are distributed. "An- 
notation" refers to tagging the data and metadata content with descriptive 
terms, which apply to individual data granules or to subsets of the data. It is a 
"system" with a unified schema for the annotation database, where distributed 
data are perceived as a unified data system to the user. One possible imple- 
mentation of AstroDAS could be as a Web 2.0 scientific data and information 
mashup (=Science2.0). AstroDAS users will include providers (authors) and an- 
notation users (consumers). Consumers (humans or machines) will eventually 
interact with AstroDAS in four ways: 

1. Integrate the annotation database content within their own data portals, 
providing scientific content to their own communities of users. 

2. Subscribe to receive notifications when new sources arc annotated or clas- 



sifted. 



3. Use the classification broker as a data integration tool to broker classes and 
annotations between sky surveys, robotic telescopes, and data repositories. 

4. Query the annotation database (either manually or through web services). 

In the last case, the users include the astronomical event message producers, 
who will want to issue their alerts with their best-estimate for the astronom- 
ical classification of their event. The classification will be generated through 
the application of ML algorithms to the networked data accessible via the VO, 
in order to arrive at a prioritized list of classes, ordered by probability of cer- 
tainty. In order to facilitate these science use cases (and others not listed here) , 
AstroDAS must have the following features: (a) it must enable collaborative, 
dynamic, distributed sharing of annotations; (b) it must access databases, data 
repositories, grids, and web services; (c) it must apply ontologies, semantics, 
dictionaries, annotations, and tags; and (d) it must employ data/text mining, 
ML, and information extraction algorithms. 

5.2 Collaborative Annotation of Classes 

Machine learning and data mining algorithms, when applied to very large data 
streams, could possibly generate the classification labels (tags) autonomously. 
Generally, scientists do not want to leave this decision-making to machine in- 
telligence alone - they prefer to have human intelligence in the loop also. When 
humans and machines work together to produce the best possible classification 
label(s), this is collaborative annotation. Collaborative annotation is a form 
of Human Computation [75] . Human Computation refers to the application of 
human intelligence to solve complex difficult problems that cannot be solved by 
computers alone. Humans can see patterns and semantics (context, content, and 
relationships) more quickly, accurately, and meaningfully than machines. Hu- 
man Computation therefore applies to the problem of annotating, labeling, and 
classifying voluminous data streams. Of course, the application of autonomous 
machine intelligence (data mining and ML) to the annotation, labeling, and 
classification of data granules is also valid and efficacious. The combination of 
both human and machine intelligence is critical to the success of AstroDAS as a 
classification broker for enormous data-intensive astronomy sky survey projects, 
such as LSST. 

5.3 A Research Agenda 

We identify some of the key research activities that must be addressed, in order 
to promote the development of a ML-based classification broker for petascale 
mining of large-scale astronomy sky survey databases. Many of these research 
activities are already being pursued by other data mining and computational 
science researchers - we hope to take advantage of all such developments, many 



of which are enabled through advanced next-generation data mining and cyber- 
infrastructure research: 

1. Before the classification labels can be useful, we must reach community 
consensus on the correct set of semantic ontological, taxonomical, and 
classification terms. There are ontologies under development in astronomy 
already - their completeness, utility, and usability need to be researched. 

2. Research into user requirements and scientific use cases will be required 
in order that we design, develop, and deploy the correct user-oriented 
petascale data mining system. 

3. A complete set of classification rules must be researched and derived for 
all possible astronomical events and objects. For objects and events that 
are currently unknown, we need to identify robust outlier and novelty 
detection rules and classifiers. These need to be researched and tested. 

4. We need to research and collect comprehensive sets of training examples 
for the numerous classes that we hope to classify. With these samples, the 
classification broker will be trained and validated. 

5. Algorithms for web services-based (perhaps grid-based or peer-to-peer) 
classification and mining of distributed data must be researched, devel- 
oped, and validated. These mining algorithms should include text mining 
as well as numeric data mining, perhaps an integrated text-numeric data 
mining approach will be most effective and thus needs to be researched. 

6. User interface and interaction models will need to be researched through 
prototypes and demonstrations of the classification broker. 

7. Research into the robust integration of the many AstroDAS system com- 
ponents will be needed. This will require investigation of different modes 
of interaction and integration, such as grids, web services, RSS feeds, on- 
tologies (expressed in RDF or OWL), linked databases, etc. 

8. Deploy a working classification broker on a live astronomical event message 
stream, to research its functionality, usefulness, bottlenecks, failure modes, 
security, robustness, and (most importantly) scalability (from the current 
few events per night, up to many tens of thousands of events per night 
in the coming decade). Fortunately, there are such event message feeds 
available today, though on a much smaller scale than that anticipated from 
LSST. 

Clearly, this is an ambitious research agenda. It will not be fully accom- 
plished in just a year or two. It will require several years of research and devel- 
opment. This is fortunate, since the most dramatic need for the classification 
broker system for astronomy will come with the start-up of LSST sky survey 
operations in 2016, lasting ten years (until 2026). So, we have a few years to get 
it right, and we will need all of those years to complete the challenging research 
program described above. 



6 Introducing the New Science of Astroinfor- 
matics 



As described above, today's astronomical research environment is highly focused 
on the design, implementation, and archiving of very large sky surveys. Many 
projects today (e.g., Palomar-Quest Synoptic Sky Survey [PQ], Sloan Digital 
Sky Survey [SDSS], and 2-Micron All Sky Survey [2MASS]) plus many more 
projects in the near future (e.g., LSST, Palomar Transient Factory [PTF], Su- 
pernova Acceleration Probe [SNAP], Panoramic Survey Telescope And Rapid 
Response System [Pan-STARRS], and Dark Energy Survey [DES]) are destined 
to produce enormous catalogs of astronomical sources. The virtual collection 
of these gigabyte, terabyte, and (eventually) petabyte catalogs will significantly 
increase science return and enable remarkable new scientific discoveries through 
the integration and cross-correlation of data across these multiple survey dimen- 
sions. Astronomers will be unable to tap the riches of this data lode without 
a new paradigm for astroinformatics that involves distributed database queries 
and data mining across distributed virtual tables of de-centralized, joined, and 
integrated sky survey catalogs. The challenges posed by this problem are daunt- 
ing, as in most disciplines today that are producing data floods at prodigious 
rates. 

The development and deployment of the astronomy Virtual Observatory 
(VO) is perceived by some as the solution to this problem. The VO pro- 
vides one-stop shopping for all end-user data needs, including access to dis- 
tributed heterogeneous data, services, and other resources (e.g., the GRID). 
Some grid-based data mining services are already envisioned or in development 
(e.g., GRIST at http://grist .caltech.edu/! the Datamining Grid, and F-MASS 
at |http: / /www.itsc.u ah.edu/f-mass / ) . However, processing and mining the as- 
sociated distributed and vast data collections are fundamentally challenging 
since most off-the-shelf data mining systems require the data to be downloaded 
to a single location before further analysis. This imposes serious scalability 
constraints on the data mining system and fundamentally hinders the scientific 
discovery process. If distributed data repositories are to be really accessible 
to a larger community, then technology ought to be developed for supporting 
distributed data analysis that can reduce, as much as possible, communication 
requirements. 

The new science of astroinformatics will emerge from this large and expand- 
ing distributed heterogeneous data environment. We define astroinformatics as 
the formalization of data-intensive astronomy for research and education [191120] . 
Astroinformatics will borrow heavily from concepts in the fields of bioinformat- 
ics and geoinformatics (i.e., GIS = Geographic Information Systems). The 
main features of this new science are: it is data-driven, data-centric, and data- 
inspired. As bioinformatics represents an entirely new paradigm for research in 
the biological sciences, beyond computational biology, so also does astroinfor- 
matics represent a new mode of data-intensive scientific research in astronomy 
that is cognizant of and dependent on the astronomical flood of astronomical 



data that is now upon us. Data mining and knowledge discovery will become the 
killer apps for this mode of scientific research and discovery. Scientific databases 
will be the "virtual sky" that astronomers will study and mine. New scientific 
understanding will flow from the discovered knowledge, which is derived from 
the avalanche of information content, which is extracted from the massive data 
collections. 

6.1 Distributed Scientific Data Mining 

Distributed data mining (DDM) of large scientific data collections will become 
the norm in astronomy, as the data collections (from the numerous large sky 
surveys) become so large that they cannot all be downloaded to a central site 
for mining and analysis. DDM algorithms will be an essential tool to enable 
discovery of the hidden knowledge buried among geographically dispersed het- 
erogeneous databases [IH [15l [50l [391 E9 . 

As an example of the potential astronomical research that DDM will enable, 
we consider the large survey databases being produced (now and in the near 
future) by various NASA missions. GALEX is producing all-sky surveys at a 
variety of depths in the near-UV and far-UV. The Spitzer Space Telescope is 
conducting numerous large-area surveys in the infrared, including regions of sky 
(e.g., the Hubble Deep Fields) that are well studied by the Hubble Space Tele- 
scope (optical), Chandra X-ray Observatory, and numerous other observatories. 
The WISE mission (to be launched circa 2009) will produce an all-sky infrared 
survey. The 2-Micron All-Sky Survey (2MASS) has catalogued millions of stars 
and galaxies in the near-infrared. Each of these wavebands contributes valuable 
astrophysical knowledge to the study of countless classes of objects in the as- 
trophysical zoo. In many cases, such as the young star-forming regions within 
starbursting galaxies, the relevant astrophysical objects and phenomena have 
unique characteristics within each wavelength domain. For example, starburst- 
ing galaxies are often dust-enshrouded, yielding enormous infrared fluxes. Such 
galaxies reveal peculiar optical morphologies, occasional X-ray sources (such as 
intermediate black holes), and possibly even some UV bright spots as the short- 
wavelength radiation leaks through holes in the obscuring clouds. All of these 
data, from multiple missions in multiple wavebands, are essential for a full char- 
acterization, classification, analysis, and interpretation of these cosmologically 
significant populations. 

In order to reap the full potential of scientific data mining, analysis, and 
discovery that this distributed data environment enables, it is essential to bring 
together data from multiple heterogeneously distributed data sites. For the 
all-sky surveys in particular (such as 2MASS, WISE, GALEX, SDSS, LSST), 
it is impossible to access, mine, navigate, browse, and analyze these data in 
their current distributed state. To illustrate this point, suppose that an all- 
sky catalog contains descriptive data for one billion objects; and suppose that 
these descriptive data consist of a few hundred parameters (which is typical for 
the 2MASS and Sloan Digital Sky Surveys). Then, assuming simply that each 
parameter requires just 2-byte representation, then each survey database will 



consume one terabyte of space. If the survey also has a temporal dimension (such 
as the LSST, which will re-image each object 1000-2000 times), then massively 
more data handling is required in order to mine the enormous potential of the 
database contents. If each of these catalog entries and attributes requires only 
one CPU cycle to process it (e.g., in a data mining operation), then many 
teraflops (up to petaflops) of computation will be required even for the simplest 
data mining application on the full contents of the databases. 

It is clearly infeasible, impractical, and impossible to drag these terabyte 
(and soon, petabyte) catalogs back and forth from user to user, from data 
center to data center, from analysis package to package, each time someone has 
a new query to pose against these various data collections. Therefore, there 
is an urgent need for novel DDM algorithms that are inherently designed to 
work on distributed data collections. We are consequently focusing our research 
efforts on these problems [39j [35] . 

6.2 Beyond the Science 

Before we conclude, it is important to mention how these scientific data mining 
concepts are also relevant to science, mathematics, and technical education in 
our society today [30] . The concept of "Using Data in the Classroom" is develop- 
ing quite an appeal among inquiry-based learning proponent^. Astronomy data 
and images in particular have a special universal appeal to students, general pub- 
lic, and all technical experts. Student-led data mining projects that access large 
astronomical databases may lead to discoveries of new comets, asteroids, explod- 
ing stars, and more. Members of both the LSST and the NVO project scientific 
teams are especially interested in this type of collaboration among scientists, 
data mining experts, educators, and students. The classroom activities (in- 
volving "cool astronomy data" ) are engaging and exciting to students and thus 
contribute to the overall scientific, technical, and mathematical literacy of the 
nation. Astroinformatics enables transparent data sharing, reuse, and analysis 
in inquiry-based science classrooms. This allows not only scientists, but also stu- 
dents, educators, and citizen scientists to tackle knowledge discovery problems 
in large astronomy databases for fun and for real. This integrated research and 
education activity matches well to the objectives of the new CODATA ADMIRE 
(Advanced Data Methods and Information technologies for Research and Educa- 
tion) initiative (www.iucr.org/iucr-top/data/docs/codataga2006_beijing.html). 
Students are trained: (a) to access large distributed data repositories; (b) to 
conduct meaningful scientific inquiries into the data; (c) to mine and analyze 
the data; and (d) to make data-driven scientific discoveries [23j . 

6.3 Informatics for Scientific Knowledge Discovery 

Finally, we close with discussions of BioDAS (the inspiration behind AstroDAS) 
and of the relevance of informatics (e.g., Bioinformatics and Astroinformat- 
ics) to the classification broker described earlier. Informatics is the discipline 

2 http:/ /sere, carleton.edu/usingdata/ 



of organizing, accessing, mining, analyzing, and visualizing data for scientific 
discovery. Another definition says informatics is the set of methods and ap- 
plications for integration of large datasets across spatial and temporal scales 
to support decision-making, involving computer modeling of natural systems, 
heterogeneous data structures, and data-model integration as a framework for 
decision- making. 

Massive scientific data collections impose enormous challenges to scientists: 
how to find the most relevant data, how to reuse those data, how to mine data 
and discover new knowledge in large databases, and how to represent the newly 
discovered knowledge. The bioinformatics research community is already solv- 
ing these problems with BioDAS (Biology Distributed Annotation System). The 
DAS provides a distributed system for researchers anywhere to annotate (mark- 
up) their own knowledge (tagged information) about specific gene sequences. 
Any other researcher anywhere can find this annotation information quickly for 
any gene sequence. Similarly, astronomers can annotate individual astronom- 
ical objects with their own discoveries. These annotations can be applied to 
observational data/metadata within distributed digital data collections. The 
annotations provide mined knowledge, class labels, provenance, and semantic 
(scientifically meaningful) information about the experiment, the experimenter, 
the object being studied (astronomical object in our case, or gene sequence in 
the case of the bioinformatics research community), the properties of that ob- 
ject, new features or functions discovered about that object, its classification, 
its connectiveness to other objects, and so on. 

Bioinformatics (for biologists) and Astroinformatics (for astronomers) pro- 
vide frameworks for the curation, discovery, access, interoperability, integration, 
mining, classification, and understanding of digital repositories through (hu- 
man plus machine) semantic annotation of data, information, and knowledge. 
We are focusing new research efforts on further development of Astroinformat- 
ics as: (1) a new subdiscipline of astronomical research (similar to the role 
of bioinformatics and geoinformatics as stand-alone subdisciplines in biological 
and geoscience research and education, respectively); and (2) the new paradigm 
for data- intensive astronomy research and education [THl __U] > which focuses on 
existing cyberinfrastructure such as the astronomical Virtual Observatory. 
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